Browse files

added docs

  • Loading branch information...
1 parent ea1a4fd commit 1c5b931cd0869a06f5d20e43d3029dd0ced0ea57 @sitaramc committed Jan 22, 2012
Showing with 404 additions and 0 deletions.
  1. +4 −0 README.mkd
  2. +212 −0 map-vs-gp.mkd
  3. +188 −0 map.mkd
View
4 README.mkd
@@ -0,0 +1,4 @@
+# map -- making xargs simpler and more powerful
+
+Pre-rendered online documentation is
+[here](http://sitaramc.github.com/map/map.html).
View
212 map-vs-gp.mkd
@@ -0,0 +1,212 @@
+# F=mapgp map versus GNU parallel
+
+Here's a feature comparision of map versus GNU parallel, mostly using examples
+in their [man page](http://www.gnu.org/software/parallel/man.html).
+
+Before we get started, note that `map` has an additional feature that GNU
+parallel doesn't have (and maybe was never meant to have, but I think it is
+very relevant); see "delimiter mode" in the main documentation.
+
+I got tired of trying these examples at about half way through their list.
+The examples were getting more and more "kitchen sink"-ish (I mean, an option
+called '--return' to get back a file from a remote computer? I know "GNU is
+Not Unix" but this is too much!)
+
+It is also clear to me that for most common uses, something with only 3
+options, which seems to match almost all the examples upto the halfway point
+of a large set of them, is a neat thing to have.
+
+----
+
+(Note: we're omitting the examples that talk about spaces in filenames, using
+the null character, etc., because I didn't have the energy to test them
+rigorously).
+
+First we create test data:
+
+ seq 1 10 | map -p 3 dd if=/dev/urandom bs=1M count=20 of=test-%.in
+
+That creates 10 files, 3 at a time.
+
+**Example: parallel gzip**
+
+ ls *.in | map -p 5 gzip
+ # default % added at the end due to '-p'
+
+They both run 5 jobs at the same time, each with one argument. You can do
+them in other ways:
+
+**Example: reading arguments from command line**
+
+No special syntax is required for this. If STDIN is a tty, map's idea of what
+is the command and what are its arguments changes:
+
+ map -p 3 -n 3 gzip *.in
+ # default back to %% due to '-n'
+
+This will pick up 3 arguments per job, 3 jobs in parallel.
+
+**Example: Inserting multiple arguments**
+
+As you saw in the "%" and "%%" comments above, inserting multiple arguments is
+just using %% instead of %:
+
+ ls *.in | map mv %% DESTDIR
+
+**Example: context replace**
+
+Again, we dont need all this extra syntax, so (using echo instead of rm for
+testing):
+
+ seq 1 10 | map echo test-%.in
+
+which runs 10 individual jobs like their first example, and
+
+ seq 1 10 | map echo test-%%.in
+
+will run in one shot, and if you changed that 10 to 5000 or something it will
+separate them when the command length has grown beyond 60,000 characters.
+
+**Compute intensive jobs and substitution**
+
+Keeping the basic structure of his commands, we'll do this.
+
+First we take out 10 test data files and move some of them into
+subdirectories:
+
+ mkdir aa bb bb/cc
+ mv *10* aa; mv *5* bb; mv *2* *4* bb/cc
+
+Now the find example:
+
+ find . -name '*.in' | map -p dd if=% of=%D/%B_swabbed.in conv=swab
+
+As you can see, we use %D etc instead of their `{.}`. In fact, our `%` is
+always exactly the same as `%D/%B.%E` (directory, basename, extension).
+
+**Substitution and redirection**
+
+First I gzip the existing files to provide the input to this step, then:
+
+ map -p "zcat % > %B" *.gz
+
+**Example: composed commands**
+
+ ls | map -- 'echo -ne %"\t"; ls %|wc -l'
+
+ ls | map -- '( echo -ne %"\t"; ls %|wc -l ) > %.dir'
+
+I didn't do the next 2; they seemed boring and not really clean. The URLs one
+was nice, but why waste all that bandwidth (and also needlessly write the damn
+files to disk)?
+
+ cat urls |map -p 'HEAD % &>/dev/null || grep -n % urls'
+
+I skipped the mirror files one, but here's the one about files in a list that
+do not exist:
+
+ cat files | map -p '[ -e % ] || echo %'
+
+**Example: removing file extension...**
+
+ ls *.zip | map -p 'mkdir %B; cd %B; unzip -q ../%'
+
+ map -p 'zcat % |bzip2 > %B.bz2 && rm %' *.gz
+
+**Example: remove 2 file extensions, calling map from itself**
+
+ ls -d *.tar.gz | map echo %B | map -v 'mkdir %B; tar -xf %.gz'
+
+**Example: download 10 images for past 30 days**
+
+It's basically doing a loop within a loop. Seems like a built-in operation at
+some level in GNU parallel but we can do it too. We'll do it with an 'echo'
+instead of 'wget':
+
+ seq 30 | map 'seq -w 10 | map -n 1 -- echo today -%, picture'
+
+The first seq puts out a number that replaced the sole '%' sign in the whole
+map, while the second has an implicit '%' at the end due to the '-n 1'.
+
+**Example: Rewriting a for-loop and a while...**
+
+pretty trivial; there are many examples earlier. In the vein of their
+example:
+
+ cat list | map -p 'do1 % scale %B.jpg; do2 < % %B'
+
+**Example: Rewriting nested-for loops**
+
+See earlier example "download 10 images..." -- you can use the same solution
+here. As I said there, GNU parallel seems to have special syntax for this; we
+actually pipe a map to another one.
+
+**Example: Group output lines**
+
+We can't do this. Without explicit redirection of some kind we may never be
+able to. This is because we get our parallelism using xargs.
+
+(Also the next 2 examples)
+
+**Example: parallel grep**
+
+Easy enough to do; but it doesn't make sense to me to parallelise something
+that will be IO-bound anyway; in all my tests it comes out slower to do this.
+
+**Example: using remote computers**
+
+...nope, we can't do it!
+
+**Example: run the same command 10 times**
+
+force it to go through xargs and put in a comment character after the command
+you want:
+
+ seq 10 | map -p 1 cmd \# boo
+
+**Example: working as cat|sh**
+
+ cat cmdlist | map -p 100 -- %
+
+## #cantwont things that map cannot/will not do
+
+(this is only from the examples list; I haven't read the full options list,
+reasoning that if an option were indispensable they'd have an example for it
+anyway)
+
+ * see note above on filenames with unusual characters. The big thing you
+ lose is parallelism when dealing with such.
+
+ * map can't do "group output" when running parallelly. You can use some
+ tricks if you really need it, like:
+
+ ... | map -p 4 'some-cmd % 2>&1 | sed -e "s/^/$$:/"' | sort | cut -f2- -d:
+
+ I don't need this feature enough to do more than that.
+
+ Besides, according to their manpage this takes a lot of CPU (why???),
+ compared to not grouping the output. My workaround clearly doesn't --
+ unless the output is too big for sort I suppose.
+
+ * same for "tagging output lines", and "keeping order ... the same ...",
+ although the latter is also achieved by the previous workaround.
+
+ * map can't handle multiple inputs in one command. However, their example
+ is easy enough with map's "delimiter mode":
+
+ ls *.tar.* | perl -ne 'chomp; print; s/\.tar//; print " $_\n"' | map -d -- cp %1 %2
+
+ As you can see, there was no need to put the inputs into separate files
+ anyway. If I find a genuine need I'll think about it...
+
+ * map can't spread STDIN breadthwise among the available jobs, nor split the
+ data into chunks to spread.
+
+ * **NO KITCHEN SINK**
+
+ * no '--sshlogin' to login to remote computers
+ * no '--transfer' to transfer files
+ * *definitely* no '--return' to get those files back
+ * no '--cleanup'
+ * *definitely* no '--trc' as shorthave for previous 3, heh!
+ * and last but not least, we're not a 'semaphore' program!
View
188 map.mkd
@@ -0,0 +1,188 @@
+# map -- making xargs simpler *and* more powerful at the same time!
+
+[[TOC]]
+
+----
+
+The `map` command was something I wrote a long time ago and have used pretty
+much forever, in a sort of "taken for granted" way. I would never have put it
+out there if I had not, by chance, discovered something called [GNU
+Parallel][gp] and started reading the **huge** list of examples on its pages.
+
+And yet, casually looking at the examples, I found that `map` could do pretty
+much all of the generic ones! So much so that I sat down and started writing
+down `map` equivalents of GNU Parallel's examples, and before I knew it I was
+about half way through their list with only a few that `map` could not do!
+The end result was this [feature comparison][mapgp].
+
+But...
+
+ * `map` is 330 lines of perl. GNU Parallel is 5000 lines.
+ * `map` has 3 options (if you don't count -h, -q, and -v). GNU Parallel has
+ almost a 100.
+
+(In all fairness, [here][cantwont]'s a list of things `map` can't/won't do
+which GNU Parallel can/will, although a lot of them are "kitchen sink" items!)
+
+And that was when I decided to put this out there as its own little project.
+
+If you use it, please let me know. Some quick documentation is right here in
+this file. Examples are [here][mapgp]. `map` responds to `-h` as you would
+expect, if you need to refresh your memory.
+
+[gp]: http://www.gnu.org/software/parallel/
+
+----
+
+## concepts
+
+'map' is like xargs in many ways, except for having very few options, and a
+fixed set of "replace strings", all using the `%` character.
+
+The **biggest difference**, conceptually, is that map will treat each input
+line as one single argument and does not space-separate them again. This
+makes it usable by default for filenames with spaces etc (although please see
+the IMPORTANT NOTES section later for details).
+
+The second difference is that `map` will happily treat the first argument as
+the command and the rest as input "lines" if STDIN is a tty, so you can say
+things like:
+
+ map gzip *.pdf
+
+or
+
+ map "zip -q -r % %" src doc conf contrib hooks
+
+The third difference is that map also has a pretty cool "delimiter
+mode" that at first seems totally unrelated to xargs but actually is not.
+There are a couple of examples later.
+
+## default replacement string
+
+For quick reference, here are the defaults if no '%' sign exists anywhere in
+the command. See the next section(s) for what '%' and '%%' mean.
+
+Normally a '%%' is assumed to be appended to the end. If the `-p` option is
+used without the `-n` option, then this changes to a single '%'.
+
+## single replacements
+
+`%` is replaced by the current input line, with a trailing slash removed if
+present. `%D` is replaced by the directory name of the current filename.
+`%B` is the basename and `%E` is the extension. (This means that `%` is
+pretty much equal to `%D/%B.%E`).
+
+These replacements use only one input line per run, so
+
+ seq 1 3 | map echo %
+
+gives you
+
+ 1
+ 2
+ 3
+
+## multiple replacements
+
+However, most often, you want all the arguments tacked on to one "run" of the
+the command. Do this by specifying a `%%`:
+
+ seq 1 3 | map echo %%
+
+returns
+
+ 1 2 3
+
+Since this is the most common reason for using map, this is the default if you
+don't specify either % or %%:
+
+ seq 1 3 | map echo
+ # returns:
+ 1 2 3
+
+A `%%` (and similarly `%%D`, `%%B`, and `%%E`) get replaced by as many input
+lines as possible (subject to internal limit of command line length and the
+user-specified `-n` value if used).
+
+Just like GNU Parallel, this replacement even works within a word, replicating
+the entire word:
+
+ seq 1 3 | map echo abc-%%-def
+
+produces
+
+ abc-1-def abc-2-def abc-3-def
+
+## multiple jobs in parallel
+
+When you run something like:
+
+ map -p 4 gzip *.pdf
+ # yes, it doesn't have to be of the form 'ls *.pdf | map -p 4 gzip'!
+
+you are running 4 jobs in parallel. This indicates that the job might be CPU
+bound (usually, though not always) so it's best to run each job on one input
+line rather than give it as many as it will take.
+
+So when you run in parallel mode, the default is `%` because that is what
+makes sense.
+
+## specifying maximum arguments per invocation
+
+However, if you use `-n`, (even if you are also using `-p`) the default
+switches back to `%%`. The logic is that specifying "maximum arguments per
+invocation" implicitly gives permission to actually *have* more than one
+argument, overriding the `-p` exception.
+
+So yeah this is an exception to an exception but I don't think it's too hard
+to remember.
+
+And if in doubt you can always specify what you want you know...
+
+## delimiter mode
+
+Here's an example; more documentation may follow if anyone asks but notice the
+delimiter character (colon) and the specification of field 1 and field 7:
+
+ cat /etc/passwd | egrep -v 'nologin|bash' | map -d=: echo %1 use %7 as shell
+
+The default delimiter is whitespace. For convenience, '-d=t' uses tabs.
+Anything else, like ':', is specified literally, like above.
+
+Here's another example: report users who have some shell as login but no GECOS
+field:
+
+ < /etc/passwd map -d=: -- '[[ %7 =~ sh ]] && [ -z "%5" ] && echo %1 || :'
+
+## IMPORTANT NOTES
+
+**Filenames with unusual characters**
+
+Map **will** work fine with such filenames **except that** you cannot use
+parallel mode (since that invokes xargs). You cannot also use redirection or
+multiple commands unless you provide your own quoting.
+
+For example (assuming some command sending in a list of filenames), this will
+fail:
+
+ ... | map 'echo -n `gzip < % | wc -c`; echo -n '*100/'; wc -c < % ' | bc
+
+but this will succeed:
+
+ ... | map 'echo -n `gzip < "%" | wc -c`; echo -n '*100/'; wc -c < "%"' | bc
+
+(by the way, this computes the size of the gzipped file as a percentage of the
+original)
+
+**OTHER WARNINGS**
+
+ * you may need to use `--` to separate map's options from it's command if
+ the command also has options. Map's option parsing is rather greedy (it's
+ on my todo list to fix this).
+
+ * never mix the 3 styles of replacement strings ('%' and its cousins, '%%'
+ and its cousins, and '%1', '%2', etc for delimiter mode). Odd things will
+ happen -- I don't check for sanity.
+
+ * parallel mode defaults to number of CPUs.

0 comments on commit 1c5b931

Please sign in to comment.