A component based console application that concurrently scans/crawls directories and web pages and counts specified keywords.
The system consists of several components that are working concurrently in conjunction. Some components are thread pool based, and other run in their own separate threads.
The user provides directories and web pages for scanning.
Web page scans continue (in depth) on all newly found links on the current page.
The user can also get or query (poll) different kinds of results (file corpus, url, domain, summary etc).
The system gracefully handles errors and provides feedback to users.
- Web Scanner
- File Scanner
- Result retriever
- Main/CLI (command line interface)
- Job dispatcher
- Directory crawler
There is also a shared blocking queue - Job queue, used for temporarily storing created jobs.
This component recursively scans directories, that the user has provided, for text corpuses (directories with a prefix "corpus_" which contain text files).
After finding a corpus a new Job is created and submitted to the Job queue.
The last modified value of corpus directories is tracked, so if a directory has been modified, it is scanned again (new jobs are created).
After finishing a scan cycle the component pauses (duration specified in the config file) before starting the next scan.
Only the directory crawler, CLI and web scanner can write to the Job queue.
Only the job dispatcher component can read the queue.
This component delegates jobs from the job queue to the appropriate thread pool component (File/Web scanner).
Jobs are submitted as InitiateTaks which pass Future objects to the Result retriever component which can then poll for results
The component is blocked if the job queue is empty.
The user initiates a new web scanning job by submitting a website url and hop count using the CLI.
After the dispatcher submits a job to the web scanner , web scanning begins.
Every web job task does the following:
- Count the specified keywords on the given website
- If the hop count is greater than 0, start new web scanning jobs for all the links found on the given website (new jobs have a decremented hop count) Already scanned urls are skipped. After a specified duration (config file) the list of scanned urls is cleared.
After the dispatcher submits a job to the file scanner (ForkJoinPool), the job is divided into smaller chunks.
RecursiveTasks divide the job, count keywords and finally combine the results.
The job is divided until the byte limit (specified in the config file) is satisfied for each task.
This component fetches results and is capable of doing some simple operations on them.
The user communicates with this component via the CLI.
There are two types of requests:
- Get (blocking command - waits until results are ready)
- Query (Returns results if they are ready, otherwise not ready message is returned)
The user can ask for results with the following commands:
- get file|directory_name - returns results of the specified corpus
- query web|url or domain - returns results (if available) of the specified url or the sum results for the specified domain
(When fetching web results for a domain, the result retriever initiates tasks for summing the results of all urls with that domain name)
The user can also ask for the result summary:
- query file|summary
- get web|summary
Specific tasks for calculating the summary are created. (The summary is stored once it is calculated)
Supported commands:
- ad directory_path - adds the directory to the list of directories that the crawler component searches for text corpuses (text corpus directories that contain text files must have corpus_ prefix to be found)
- aw url - initiates a web scan for the provided url (hop count is taken from config file)
- get file|corpus_name
- query file|corpus_name
- get web|corpus_url
- query web|corpus_url
- get web|corpus_domain
- query web|corpus_domain
- get file|summary
- query file|summary
- get web|summary
- query web|summary
- cfs - clear file summary
- cws - clear web summary
- stop - exit the application
Parameters are read during app start and cannot be changed during app operation.
File structure:
keywords=one,two,thre - list of keywords to be counted
file_corpus_prefix=corpus_ - the expected prefix for text corpus directories
dir_crawler_sleep_time=1000 - directoriy crawler pause duration
file_scanning_size_limit=1048576 - limit for file scanner tasks given in bytes
hop_count=2 - number of hops the web scanner does (depth)
url_refresh_time=86400000 - list of visited urls is cleared
This project was an assignment as a part of the course - Concurrent and Distributed Systems during the 8th semester at the Faculty of Computer Science in Belgrade. All system functionalities were defined in the assignment specifications.
You can download the .jar files here.
- Stefan Ginic - stefangwars@gmail.com