Corpus

Description

Corpus is an asynchronous web crawler for you to grab a set of sample files. Then use afl-cmin to create a minset of them for later use with AFL

Setup

Corpus has been implemented using asyncio module from python 3.5 therefore you need to use python >= 3.5.0.

Pre-requisites

virtualenvwrapper>=4.7
$ pip install mkvirtualenv

Virtualenv configuration is left to the discretion of the user. Once you're setup go to the next steps.

Installation

Clone source and then create virtualenv to use Corpus app as follows:

$ cd corpus
$ mkvirtualenv -p python3 -r requirements.txt corpus

Now you are ready to use it.

Usage

$ workon corpus
(corpus) $ ./corpus.py
usage: corpus.py --roots [ROOT_DOMAINS [ROOT_DOMAINS ...]] --file_type
                 FILE_TYPE -o OUT_DIR [-i] [--select] [-r MAX_REDIRECT]
                 [-t MAX_TRIES] [-c MAX_TASKS] [-e REGEX] [-s] [-v] [-q]
                 [-m MAX_SIZE]
corpus.py: error: the following arguments are required: --roots, --file_type, -o/--output
(corpus) $
(corpus) $ ./corpus.py www.adobe.com --file-type pdf -o test

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
crawler		crawler
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
corpus.py		corpus.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Corpus

Description

Setup

Pre-requisites

Installation

Usage

About

Contributors 2

Languages

License

thelumberjhack/corpusgen

Folders and files

Latest commit

History

Repository files navigation

Corpus

Description

Setup

Pre-requisites

Installation

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages