Concurrent-Keyword-Counter

A component based console application that concurrently scans/crawls directories and web pages and counts specified keywords.

Overview

The system consists of several components that are working concurrently in conjunction. Some components are thread pool based, and other run in their own separate threads.

The user provides directories and web pages for scanning.
Web page scans continue (in depth) on all newly found links on the current page.
The user can also get or query (poll) different kinds of results (file corpus, url, domain, summary etc).
The system gracefully handles errors and provides feedback to users.

Thread pool based components:

Web Scanner
File Scanner
Result retriever

Single thread components:

Main/CLI (command line interface)
Job dispatcher
Directory crawler

There is also a shared blocking queue - Job queue, used for temporarily storing created jobs.

Component details:

Directory crawler:

This component recursively scans directories, that the user has provided, for text corpuses (directories with a prefix "corpus_" which contain text files).
After finding a corpus a new Job is created and submitted to the Job queue.
The last modified value of corpus directories is tracked, so if a directory has been modified, it is scanned again (new jobs are created).
After finishing a scan cycle the component pauses (duration specified in the config file) before starting the next scan.

Job (blocking) queue:

Only the directory crawler, CLI and web scanner can write to the Job queue.
Only the job dispatcher component can read the queue.

Job dispatcher:

This component delegates jobs from the job queue to the appropriate thread pool component (File/Web scanner).
Jobs are submitted as InitiateTaks which pass Future objects to the Result retriever component which can then poll for results
The component is blocked if the job queue is empty.

Web scanner:

The user initiates a new web scanning job by submitting a website url and hop count using the CLI.
After the dispatcher submits a job to the web scanner , web scanning begins.
Every web job task does the following:

Count the specified keywords on the given website
If the hop count is greater than 0, start new web scanning jobs for all the links found on the given website (new jobs have a decremented hop count) Already scanned urls are skipped. After a specified duration (config file) the list of scanned urls is cleared.

File scanner:

After the dispatcher submits a job to the file scanner (ForkJoinPool), the job is divided into smaller chunks.
RecursiveTasks divide the job, count keywords and finally combine the results. The job is divided until the byte limit (specified in the config file) is satisfied for each task.

Result retriever:

This component fetches results and is capable of doing some simple operations on them.
The user communicates with this component via the CLI.
There are two types of requests:

Get (blocking command - waits until results are ready)
Query (Returns results if they are ready, otherwise not ready message is returned)

The user can ask for results with the following commands:

get file|directory_name - returns results of the specified corpus
query web|url or domain - returns results (if available) of the specified url or the sum results for the specified domain
(When fetching web results for a domain, the result retriever initiates tasks for summing the results of all urls with that domain name)

The user can also ask for the result summary:

query file|summary
get web|summary
Specific tasks for calculating the summary are created. (The summary is stored once it is calculated)

CLI:

Supported commands:

ad directory_path - adds the directory to the list of directories that the crawler component searches for text corpuses (text corpus directories that contain text files must have corpus_ prefix to be found)
aw url - initiates a web scan for the provided url (hop count is taken from config file)
get file|corpus_name
query file|corpus_name
get web|corpus_url
query web|corpus_url
get web|corpus_domain
query web|corpus_domain
get file|summary
query file|summary
get web|summary
query web|summary
cfs - clear file summary
cws - clear web summary
stop - exit the application

Configuration file (app.properties):

Parameters are read during app start and cannot be changed during app operation.

File structure:

keywords=one,two,thre - list of keywords to be counted
file_corpus_prefix=corpus_ - the expected prefix for text corpus directories
dir_crawler_sleep_time=1000 - directoriy crawler pause duration
file_scanning_size_limit=1048576 - limit for file scanner tasks given in bytes
hop_count=2 - number of hops the web scanner does (depth)
url_refresh_time=86400000 - list of visited urls is cleared

Usage example:

Sidenote

This project was an assignment as a part of the course - Concurrent and Distributed Systems during the 8th semester at the Faculty of Computer Science in Belgrade. All system functionalities were defined in the assignment specifications.

Download

You can download the .jar files here.

Contributors

Stefan Ginic - stefangwars@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.settings		.settings
bin		bin
download		download
example		example
images		images
lib		lib
src		src
.classpath		.classpath
.project		.project
README.md		README.md
app.properties.txt		app.properties.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concurrent-Keyword-Counter

Overview

Thread pool based components:

Single thread components:

Component details:

Directory crawler:

Job (blocking) queue:

Job dispatcher:

Web scanner:

File scanner:

Result retriever:

CLI:

Configuration file (app.properties):

Usage example:

Sidenote

Download

Contributors

About

Releases

Packages

Languages

stefanGT44/Concurrent-Keyword-Counter

Folders and files

Latest commit

History

Repository files navigation

Concurrent-Keyword-Counter

Overview

Thread pool based components:

Single thread components:

Component details:

Directory crawler:

Job (blocking) queue:

Job dispatcher:

Web scanner:

File scanner:

Result retriever:

CLI:

Configuration file (app.properties):

Usage example:

Sidenote

Download

Contributors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages