Skip to content
Extracting URLs of a specific target based on the results of "commoncrawl.org"
Branch: master
Clone or download
si9int Merge pull request #7 from jgor/master
Allow cc.py to run outside of project directory
Latest commit c8f11f0 Mar 10, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE Initial commit Jul 13, 2018
README.md
cc.py use script-relative path to find index.txt Feb 14, 2019
index.txt

README.md

cc.py

Extracting URLs of a specific target based on the results of "commoncrawl.org".
Updated to v.0.3 | Whats new:

  • 65% faster proceeding
  • Specify a year via -y/--year, e.g.: -y 2018
  • Specify an output file via -o/--out, e.g.: -o whatever.txt
  • Crawl all pages for a specific index -i/--index, e.g.: cc.py army.mil -i CC-MAIN-2018-05
  • List all available indexes -l/--list, e.g.: cc.py army.mil -l

ToDo

  • Crawl for a specific index
  • Implementation of multithreading
  • Allowing a range of years as input
  • Implementing direct-grep
  • Temporary file-writing

Usage

cc.py [-h] [-y YEAR] [-o OUT] [-l] [-i INDEX] [-u] domain

positional arguments:
  domain                domain which will be crawled for

optional arguments:
  -h, --help            show this help message and exit
  -y YEAR, --year YEAR  limit the result to a specific year (default: all)
  -o OUT, --out OUT     specify an output file (default: domain.txt)
  -l, --list            Lists all available indexes
  -i INDEX, --index INDEX
                        Crawl for a specific index (this will crawl all
                        pages!)
  -u, --update          Update index file

Example

python3 cc.py github.com -y 2018 -o github_18.txt
cat github_18.txt | grep user

Dependencies

  • Python3
You can’t perform that action at this time.