Skip to content

A JavaFX desktop application for concurrently computing the distribution of specified words in large files/directories and plotting the results on a graph.

Notifications You must be signed in to change notification settings

stefanGT44/Concurrent-Word-Distribution-Tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Concurrent-Word-Distribution-Tool

A JavaFX desktop application for concurrently computing the distribution of specified words in large directories/files and plotting the results on a graph. (Distribution - the number of times each word appears in a text file)

Overview

The application functions as a pipeline that consists of several types of components that are working concurrently in conjunction.
There are three types of components:

  1. Input component - data entry point
  2. Cruncher component - data processing
  3. Output component - storing and visualizing results

The user can make multiple instances of each component and link (connect) them in a way he sees fit.
Every component instance runs in its own thread and every component type has a dedicated thread pool.
Input components provide input to cruncher components, which provide input to the output components.
Component communication (data flow) is based on shared blocking queues.
The architecture of the system makes it easy to integrate new types of components.
The application is optimized to use as little RAM as possible.
Components and the main app follow the MVC design pattern.

Component pipeline example:

Alt text


Usage example

Alt text

Input 0 is linked to Cruncher 0 which is automatically linked to the default output component.
Input 0 is active and currently reading one text file (see the bottom blue label).
Cruncher 0 is currently computing the distribution in three files that Input 0 has provided.
Cruncher progress can also be monitored in the output component, if an item in the list has a prefix *, the results are not ready yet (cruncher is still working on that file).



Alt text

In this image the Input and Cruncher components have finished their work from the previous image.
The output component is showing the distribution of words in the file wiki-7.txt.
It is also currently computing the sum distribution that the user specified.



Alt text

In this example the output component is computing the specified distribution sum (aggregation) and is waiting for the final file results to become available, in order to finish the computation.

Component details:

Every component instance runs in its own thread and every component type has a dedicated thread pool for completing its main tasks.
Components communicate among each other using blocking queues. Every component has a blocking queue that its predecessors can write to.

Input components:

Every input component can be linked to one or more cruncher components.
The main objective of input components is to scan directories for text files which are then read and supplied to linked crunchers.
The reading of text files is done in a separate task within the input thread pool.
Input components are tied to a disk (drive) that the user specifies when creating a new instance.
Only directories on the specified disk can be scanned, and only one reading task can be active in the thread pool per disk.
After one scan cycle is finished, the component pauses for a certain duration before the next cycle (specified in the config file).
The user can manually pause and resume input components.
The last modified value of scanned directories is tracked, so if a directory has been modified, it is scanned again (the text files are read again).

Cruncher components:

In the current implementation, cruncher components are automatically linked to one default output component, but the code supports multiple output components. The main objective of cruncher components is to count the word distribution in text objects that linked input components provided, and supply linked output components with the results.
Upon receiving input text, a new RecursiveTask is created within the cruncher thread pool and a Future object is forwarded to all linked output components.
The task recursively creates new tasks and splits the job (text) into smaller chunks (chunk size specified in the config file), after which the distribution computation is done, and finally the results are combined. Every cruncher instance has a specified arity number.
If arity = 1 the cruncher counts the number of times every single word appears in a text,
if arity = 3 the cruncher counts the number of times every three consecutive words, in exactly that order, appear in a text , etc.

Output components:

Output components store results provided by the linked crunchers.
The results can be aggregated (this is done within the output component thread pool), sorted and plotted on the graph. Output components are aware of all created jobs, even unfinished ones (active jobs have * as a prefix).
The component offers get (blocking) and poll (not blocking) methods for retrieving results. Single result plotting uses the poll method and notifies the user if results are not ready yet.
The aggregation task uses the get method and waits (is blocked) if some results are not ready yet.
All types of results (single or aggregated) are sorted before plotting.

System quality:

The application is optimized to use as little RAM as possible. But in the events that RAM runs out, the user is notified and the application shut down.
GUI buttons, lists, labels are always refreshed and enabled only when that makes sense.
The user is notified when errors occur with an error message alert.
When exiting the application, new jobs cannot be started, and all unfinished jobs must finish (reading a text file, cruncher working on a file, output aggregating, sorting or plotting results). If unfinished jobs exist, the user is shown a modal dialog with a message that the application is in the process of exiting.

Configuration file (app.properties):

Parameters are read during app start and cannot be changed during app operation.

File structure:

file_input_sleep_time=5000 - pause duration for the input component
disks=data/disk1/;data/disk2 - list of disks for the input component
counter_data_limit=10000000 - job limit for counting tasks given in characters
sort_progress_limit=10000 - number of comparisons after which progress bar is updated during sorting

Sidenote

This project was an assignment as a part of the course - Concurrent and Distributed Systems during the 8th semester at the Faculty of Computer Science in Belgrade. All system functionalities were defined in the assignment specifications.

Download

You can download the .jar files here.

Contributors

About

A JavaFX desktop application for concurrently computing the distribution of specified words in large files/directories and plotting the results on a graph.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published