Simple java (1.6) crawler to crawl web pages on one and same domain. If your page is redirected to another domain, that page is not picked up EXCEPT if it is the first URL that is tested. Basicly you can do this:
- Crawl from a start point, defining the depth of the crawl and decide to crawl only a specific path
- Output all working urls
- Output the data to a csv file, separated by working (200 response code) and non working url
- Output the data to two text files, one with working urls and one with none working. Each url will be on one new line.
- Output url:s that contains a keyword in the html
- Exprimental support for verifying that assets on a page work
A simple crawl have the following options, and will output the url:s crawled to system out. Note, only urls that returns 200 will be outputted by default:
usage: CrawlToSystemOut [-l ] [-np ] [-p ] -u [-v ] -l,--level how deep the crawl should be done, default is 1 [optional] -np,--notFollowPath no url:s on this path will be crawled [optional] -p,--followPath stay on this path when crawling [optional] -u,--url the page that is the startpoint of the crawl, examle http://mydomain.com/mypage -v,--verify verify that all links are returning 200, default is set to true [optional] -rh,--requestHeaders the request headers by the form of header1:value1@header2:value2 [optional]
You can choose to output the crawled list to two plain text files, one with working urls, and one with the none working:
usage: CrawlToFile [-ef ] [-f ] [-l ] [-np ] [-p ] -u [-v ] [-ve ] -ef,--errorfilename the name of the error output file, default name is errorurls.txt [optional] -f,--filename the name of the output file, default name is urls.txt [optional] -l,--level how deep the crawl should be done, default is 1 [optional] -np,--notFollowPath no url:s on this path will be crawled [optional] -p,--followPath stay on this path when crawling [optional] -u,--url the page that is the startpoint of the crawl, examle http://mydomain.com/mypage -v,--verify verify that all links are returning 200, default is set to true [optional] -ve,--verbose verbose logging, default is false [optional] -rh,--requestHeaders the request headers by the form of header1:value1@header2:value2 [optional]
You can choose to output the result in a csv file, and separate the urls by working and non working:
usage: CrawlToCsv [-f ] [-l ] [-np ] [-p ] -u [-v ] -f,--filename the name of the csv output file, default name is result.csv [optional] -l,--level how deep the crawl should be done, default is 1 [optional] -np,--notFollowPath no url:s on this path will be crawled [optional] -p,--followPath stay on this path when crawling [optional] -u,--url the page that is the startpoint of the crawl, examle http://mydomain.com/mypage -v,--verify verify that all links are returning 200, default is set to true [optional] -rh,--requestHeaders the request headers by the form of header1:value1@header2:value2 [optional]
Crawl and output urls that contains specific keyword in the html
usage: CrawlToPlainTxtOnlyMatching -k [-l ] [-np ] [-p ] -u [-v ] -k,--keyword the keyword to search for in the page [required] -l,--level how deep the crawl should be done, default is 1 [optional] -np,--notFollowPath no url:s on this path will be crawled [optional] -p,--followPath stay on this path when crawling [optional] -u,--url the page that is the startpoint of the crawl, examle http://mydomain.com/mypage -v,--verify verify that all links are returning 200, default is set to true [optional] -rh,--requestHeaders the request headers by the form of header1:value1@header2:value2 [optional]
There are also configuration that you either configure in the crawler.properties file or override them by adding them as a system property. By default they are configured:
## Override these properties by set a system property com.soulgalore.crawler.nrofhttpthreads=5 com.soulgalore.crawler.threadsinworkingpool=5 com.soulgalore.crawler.http.socket.timeout=5000 com.soulgalore.crawler.http.connection.timeout=5000 # Auth like: # soulislove.com:80:username:password,... com.soulgalore.crawler.auth= # Proxy properties, if you are behind a proxy. ## The host by this special format: http:proxy.soulgalore.com:80 com.soulgalore.crawler.proxy=
The location of crawler.properties file can be set with the system property com.soulgalore.crawler.propertydir.
Checkout the project and compile your own full jar (all dependencies included):
git clone git@github.com:soulgalore/crawler.git
or add it to Maven, if you want to include the crawler in your project:
<dependency> <groupId>com.soulgalore</groupId> <artifactId>crawler</artifactId> <version>1.5.11</version> </dependency>
Running from the jar, fetching two levels depth and only fetch urls that contains "/tagg/"
java -jar crawler-1.5.11-full.jar -u http://soulislove.com -l 2 -p /tagg/
Running from the jar, adding base auth
java -jar -Dcom.soulgalore.crawler.auth=soulgalore.com:80:peter:secret crawler-1.5.11-full.jar -u http://soulislove.com
Running from the jar, output urls in csv file
java -cp crawler-1.5.11-full.jar com.soulgalore.crawler.run.CrawlToCsv -u http://soulislove.com
Running from the jar, output urls into two text files: workingurls.txt and nonworkingurls.txt
java -cp crawler-1.5.11-full.jar com.soulgalore.crawler.run.CrawlToFile -u http://soulislove.com -f workingurls.txt -ef nonworkingurls.txt
Running from the jar, verify that assets are ok
java -cp crawler-1.5.11-full.jar com.soulgalore.crawler.run.CrawlAndVerifyAssets -u http://www.peterhedenskog.com
Copyright 2014 Peter Hedenskog
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.