Skip to content

torsh4rk/BASHkrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

BASHkrawler

1. Description

Bash Web Crawler to find URLs by parsing the HTML source code and the found javascript links on homepage of a required specific website domain. It is also possible to use a pattern word as optional argument to customize the URLs extraction.


2. Install

➜ git clone https://github.com/torsh4rk/BASHkrawler.git
➜ cd BASHkrawler/ && chmod +x bashkrawler.sh
➜ ./bashkrawler.sh

3. Example Usage

Fig.1 - Displaying banner


3.1. Making HTML parsing without using a pattern word to match

Fig.2 - Chosing the option 1 to find all URLs at target domain www.nasa.gov via HTML parsing

Fig.3 - Finding all URLs at target domain www.nasa.gov via HTML parsing


3.3. Finding all JS links at target domain and parsing them without using a pattern word to match

Fig.4 - Chosing the option 2 to find all JS links at target domain www.nasa.gov and extract all URLs from this found JS links


3.3. Making a full web crawling by running the option 1 and 2 without using a pattern word to match

Fig.5 - Chosing the option 3 to find all URLs at target domain www.nasa.gov via option 1 and 2 without using a pattern word to match

Fig.6 - Finishing the full web crawling at target domain www.nasa.gov


3.4. Making HTML parsing by using a pattern word to match

Fig.7 - Make web crawling at a target domain and find all URLs with the word ".nasa"

Fig.8 - Chosing the option 3 to find all URLs with the word "nasa" at target domain www.nasa.gov via option 1 and 2

Fig.9 - Finishing the full web crawling at target domain www.nasa.gov by using the word ".nasa"


4. References

https://medium.datadriveninvestor.com/what-is-a-web-crawler-and-how-does-it-work-b9e9c2e4c35d

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages