GitHub - tkisason/linkcrawl: Simple crawler framework, useful for creating your custom web crawlers/bots

This repository has been archived by the owner on Jul 19, 2022. It is now read-only.

tkisason / linkcrawl Public archive

Notifications You must be signed in to change notification settings
Fork 3
Star 7

Simple crawler framework, useful for creating your custom web crawlers/bots

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
README		README
enum_pastie.py		enum_pastie.py
linkcrawl.py		linkcrawl.py

Repository files navigation

LinkCrawl, this is an ongoing progress to create a single script that can be used for all matter of pentesting/research needs.
Some functionality that will be added over time is:

* Extract all links from a page (done)
* Extract all links with same origin of the page (done)
* Extract all text content from a given class (done)
* Recursive searching 
* Run custom recursive searches for specific content on various sites
* Graphviz visualization of links/content

As an example of what linkcrawl should do, there is a enum_pastie script, which is an example how you can use linkcrawl to enumerate pastie.org to find interesting pasties. An ongoing effort is to minimse the needed code for custom crawlers. 

Try some of the following queries: "123456 qwerty", "DB_PASSWORD", "phpmyadmin", "connect(", "exploit"

For example:

tony@enigma:~/2code/linkcrawl$ ./enum_pastie.py 
[+] LinkCrawl: Enumerate pastie.org, enter query: 123456 qwerty 
[+] Enter output file name: passwords
[+] Got links, dumping content to file, delimiter is: EOF EOF EOF
[+] Working: .............................................................................

[+] Enumeration done, output is in: passwords.html

After that, open the passwords.html in a web browser and enjoy :) More stuff will be added later on as i have time.


This script requires LXML (python-lxml) library. Make sure you install it before using the script. 

Don't expect an extensive GUI or CLI implementation, this is mostly a header, so you can prototype your custom searchers/parsers/crawlers faster, CLI parameters will be minimal at best, i like to use this with ipython. 

Usage is simple : ./linkcrawl.py [OPTION] {,http://}some.web.site

Options:

	-o: print only those that have the same origin as the requested site