Skip to content
Browse files

Add larger architectural aspirations.

  • Loading branch information...
1 parent 9e5ca72 commit be675899a78ec9cd0d1946b86768fdd932beab58 @tdunning committed Oct 9, 2012
Showing with 13 additions and 6 deletions.
  1. +13 −6 README.md
View
19 README.md
@@ -1,22 +1,27 @@
mapr-spout
==========
-A Storm spout that tails a file (or collection of files)
+A Storm spout that tails a file (or collection of files).
The basic idea is that you supply a parser for events that can read from an input stream and this spout will read from files whenever new data gets added. The cool thing is that if this spout gets killed and restarted, it will know how to deal with it correctly and will pick up where it left off.
-The way that this works is that each time nextTuple() is called, the current file is parsed until we hit a limit
-of the number of tuples to read in one call or the end of the file. If we hit the end of the file, then we move to the
-next file in the directory of interest.
+The way that this works is that there are many objects, each of which is observing a single directory. Each time nextTuple() is called, one of the directory observers is asked to check its directory. The current file in that directory is parsed until we hit a limit of the number of tuples to read in one call or the end of the file. If we hit the end of the file, then we move to the next file in the directory of interest. When we hit the limit on the number of files to parse from a single directory, that observer is moved to the end of the queue of observers. This allows many messages to be pulled from a single directory while catching up but ultimately provides fair sharing between all live directories. This strategy also decreases the overhead of checking files for modifications in an overload situation.
+
+Every so often, the directory trees below the known roots is scanned to see if there are new directories that need observers. If there are, then the appropriate new observers are created and inserted at the head of the queue so that they will catch up quickly. This means that there can be a substantial delay before any messages are processed from a new directory (say 30s or so), but this will minimize the cost of scanning for new directories.
When running in reliable mode, tuples are held in memory until they are acknowledged.
-For right now, there is no driver for the spout. All that is in place and tested are the DirectoryScanner and the
-SpoutState classes.
+For right now, there is no driver for the spout. All that is in place and tested are the DirectoryScanner and the SpoutState classes.
Data Collection Architecture
==========
+Data collection consists of appending to files and rolling to new files for existing directories. When messages are received that should be stored in new directories, those directories should be created and then business as usual should proceed.
+
+For convenience, it is nice to emulate API's like that of Kafka. In such an emulation, it is probably good to have a single directory per topic and topics should be assigned to particular API end-points. The API servers should monitor Zookeeper to see when the collection of API servers changes.
+
+Messages that arrive at the wrong API server for a particular topic should be forwarded to the correct server and the reply to the client should contain the correct API server to use. This will allow clients to rapidly adapt to changes in the API serving farm without having to access Zookeeper directly. It will also allow clients to robust to network misconfigurations that prevent them from contacting all of the API servers.
+
- data collection will be via an API similar to (if not identical to) Kafka's. The client should be able to interrogate Zookeeper to find the current live data collector
- there will be a fail-over data collector process that appends to the data files
@@ -31,6 +36,8 @@ Data Collection Architecture
Missing bits
==========
+The queueing of directory servers is missing. So is the data collection API and servers.
+
I haven't written any parsers yet. Such an exercise might expose some interesting problems.
Likewise, we should have a couple of different strategies to handle the situation when there are lots of pending

0 comments on commit be67589

Please sign in to comment.
Something went wrong with that request. Please try again.