PdfRep Dataset

Overview

The PdfRep dataset is a comprehensive collection of PDF files, compiled from various reliable sources to support research in areas such as malware analysis, document classification, and cybersecurity. We collected data from different resources. This dataset is used for the research purpose. To use this dataset, please cite our work:

R. Liu, R. Joyce, C. Matuszek and C. Nicholas, "Evaluating Representativeness in PDF Malware Datasets: A Comparative Study and a New Dataset," 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 2023, pp. 3017-3024, doi: 10.1109/BigData59044.2023.10386516.

Data Sources

The PdfRep dataset is an amalgamation of files from four distinct sources:

Contagio Dataset: A well-known repository of malware samples. Accessible at Contagio Blogspot.
CIC Dataset: This dataset includes a variety of malicious PDF files. Available for download on the CIC Dataset page.
VirusShare: A collection of malicious files. Our experience shows this collection can significantly improve the trained model performance: VirusShare Data
Govdocs: This dataset consists of benign files and is hosted by Digital Corpora. These files can be found at Digital Corpora.
Feature File: The extracted features can be downloaded on the Feature Data

Dataset Structure

The dataset includes a mix of benign and malicious PDF files, providing a diverse range of samples for analysis.

File References

For easy navigation and reference, users can consult the filename column in the feature_file.csv file. This column provides specific filenames included in the PdfRep dataset, facilitating straightforward identification and access to individual files. The corresponding features used in this research can also be found in it.

Usage

This dataset is intended for use in academic and research settings. Users are encouraged to utilize this data for research. If you encounter an error while using the pdfrw library to extract features, please try using this modified version instead: https://github.com/mzweilin/PDF-Malware-Parser

Acknowledgments

We acknowledge the contributions of the respective organizations and repositories that have made their data available, aiding in the creation of this comprehensive dataset.

Contact

For any inquiries or further information regarding the PdfRep dataset, please feel free to contact us.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
extract_feature.py		extract_feature.py
pdf_genome.py		pdf_genome.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PdfRep Dataset

Overview

Data Sources

Dataset Structure

File References

Usage

Acknowledgments

Contact

About

Releases

Packages

Languages

License

thanlau/PdfRep

Folders and files

Latest commit

History

Repository files navigation

PdfRep Dataset

Overview

Data Sources

Dataset Structure

File References

Usage

Acknowledgments

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages