About the application:

The application is all about harnessing the capabilities of below components to create a web crawler that can crawl any site.
So here we are mainly interested in getting the pin code of various areas/cities across India from a site called javataskforce.com.

This Application uses:

Apache CXF:
Making REST/HTTP requests to a remote site and fetching the page.
JPA/Hibernate with HSQLDB:
Using JPA with Hibernate to populate the HSQLDB with the information we are interested in.
Apache Camel:
Coordinating between CXF/JPA, leveraging many tasks that require a lot of coding, creating routes that state what to do and when to do.

Overview:

The whole application is interconnected by Camel routes, which are as follows (all routes are defined in camel-routes.xml)

states-invoke-route
- invoke the crawler service to get state
- parse/process the states obtained
- put them in another Queue
- store the states into HSQLDB
districts-invoke-route
- pool the DB (HsqlDb) and get the State with some particular status (INITIAL)
- update the status of State to (IN_PROCESS)
- invoke the crawler service to obtain districts for state
- parse/process the districts obtained
- put them in another Queue
- store the districts into HSQLDB
- mark the state as processed (COMPLETED)
pincode-invoke-route
- pool the DB (HsqlDb) and get the District with some status (INITIAL)
- update the status of District to (IN_PROCESS)
- invoke the crawler service to obtain areas/cities for the fetched District
- parse/process the areas/cities obtained
- put them in another Queue
- store the area/city into HSQLDB
- mark the district as processed (COMPLETED)

In a nutshell, there are some routes that invoke a crawl URL, and the result is parsed/processed and then saved into the crawl Database.

Configuring the application:

Scheduling configuration:

Open camel-routes.xml and change the cron statement to a suitable time: quartz://myGroup/myTimerName?cron=30 06 00 03 08 2016
Open the camel-routing-policy.xml and schedule time for:
districts-invoke-route
pincode-route-policy

Running the application:

Run the following commands: mvn clean install
Run the HSQL Database.
Go to the hsqldb folder:

On Windows:
- Click db_start.bat.
On Unix/Linux:
- Execute sh db_start.sh.

Execute the following command to run the application:

Monitoring:

The application can be monitored using JConsole at port 86764 (the port can be changed by changing the JMX setting in pom.xml):

-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=8764

Open JConsole and add the following connection string:

service:jmx:rmi:///jndi/rmi://127.0.0.1:8764/jmxrmi

Monitoring HSQL DB:
Windows Client:
- Execute db_client.bat available in the hsqldb folder.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.settings		.settings
hsqldb		hsqldb
src/main		src/main
.classpath		.classpath
.project		.project
README.md		README.md
ReadMe.txt		ReadMe.txt
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About the application:

This Application uses:

Overview:

Configuring the application:

Running the application:

Monitoring:

About

Releases

Packages

Contributors 2

Languages

santoshjoshi/web-crawler

Folders and files

Latest commit

History

Repository files navigation

About the application:

This Application uses:

Overview:

Configuring the application:

Running the application:

Monitoring:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages