Numeric Fu for the command line
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

nfu: Numeric Fu for your shell

NOTE: nfu is unlikely to receive any more major updates, as I'm currently working on its successor ni.

nfu is a text data hub and transformation tool with a large set of composable functions and source/sink adapters. For example, if you wanted to do a map-side inner join between a PostgreSQL table, a CSV from the Internet, and stuff on HDFS and gather the results into a sorted/uniqued text file:

$ nfu sql:P@:'%*mytable' \
      -i0 @[ -F , ] \
      -H@::H. [ -i0 hdfsjoin:/path/to/hdfs/data ] ^gcf1. \
      -g \
  > output

# equivalent long version
$ nfu sql:Pdbname:'select * from mytable' \
      --index 0 @[ --fieldsplit , ] \
      --hadoop /tmp/temp-resharded-upload-path [ ] [ ] \
      --hadoop . [ --index 0 hdfsjoin:/path/to/hdfs/data ] \
                 [ --group --count --fields 1. ] \
      --group \
  > output

Then if you wanted to plot a cumulative histogram of the .metadata.size JSON field from the third column values, binned to the nearest 100:

$ nfu output -m 'jd(%2).metadata.size' -q100ocOs1f10p %l

# equivalent long version
$ nfu output --map 'json_decode($_[2]).metadata.size' \
             --quant 100 --order --count --rorder \
             --sum 1 --fields 10 --plot 'with lines'



MIT license as usual.

Options and stuff

If you invoke nfu with no arguments, it will give you the following summary:

usage: nfu [prefix-commands...] [input-files...] commands...
where each command is one of the following:

  -A|--aggregate  (1) <aggregator fn>
     --append     (1) <pseudofile; appends its contents to current stream>
  -a|--average    (0) -- window size (0 for full average) -- running average
  -b|--branch     (1) <branch (takes a pattern map)>
  -R|--buffer     (1) <creates a pseudofile from the data stream>
  -c|--count      (0) -- counts by first column value; like uniq -c
  -S|--delta      (0) -- value -> difference from last value
  -D|--drop       (0) -- number of records to drop
     --duplicate  (2) <two shell commands as separate arguments>
  -e|--each       (1) <template; executes with {} set to each value>
     --entropy    (0) -- running entropy of relative probabilities/frequencies
  -E|--every      (1) <n (returns every nth row)>
  -L|--exp        (0) -- optional base (default e)
  -f|--fields     (0) -- string of digits, each a zero-indexed column selector
  -F|--fieldsplit (1) <regexp to use for splitting>
     --fold       (1) <function that returns true when line should be folded>
  -g|--group      (0) -- sorts ascending, takes optional column list
  -H|--hadoop     (3) <hadoop streaming: outpath|.|@, mapper|:, reducer|:|_>
     --http       (1) <HTTP adapter for TCP server output>
  -i|--index      (2) <field index, unsorted pseudofile to join against>
  -I|--indexouter (2) <field index, unsorted pseudofile to join against>
  -z|--intify     (0) -- convert column to dense integers (linear space)
  -j|--join       (2) <field index, sorted pseudofile to join against>
  -J|--joinouter  (2) <field index, sorted pseudofile to join against>
  -k|--keep       (1) <row filter fn>
  -l|--log        (0) -- optional base (default e)
  -m|--map        (1) <row map fn>
     --mplot      (1) <gnuplot arguments per column, separated by ;>
  -N|--ntiles     (1) <takes N, produces ntiles of numbers>
  -n|--number     (0) -- prepends line number to each line
     --octave     (1) <pipe through octave; vector is called xs>
  -o|--order      (0) -- sorts ascending by general numeric value
     --partition  (2) <partition id fn, shell command (using {})>
     --pipe       (1) <shell command to pipe through>
  -p|--plot       (1) <gnuplot arguments>
  -M|--pmap       (1) <row map fn (executed multiple times in parallel)>
  -P|--poll       (2) <interval in seconds, command whose output to collect>
     --prepend    (1) <pseudofile; prepends its contents to current stream>
     --preview    (0) 
  -q|--quant      (1) <number to round to>
  -r|--read       (0) -- reads pseudofiles from the data stream
  -K|--remove     (1) <inverted row filter fn>
     --repeat     (2) <repeat count, pseudofile to repeat>
  -G|--rgroup     (0) -- sorts descending, takes optional column list
  -O|--rorder     (0) -- sorts descending by general numeric value
     --sample     (1) <row selection probability in [0, 1]>
     --sd         (0) -- running standard deviation
     --splot      (1) <gnuplot arguments>
  -Q|--sql        (3) <create/query SQL table: db[:[+]table], schema|_, query|_>
  -s|--sum        (0) -- value -> total += value
  -T|--take       (0) -- n to take first n, +n to take last n
     --tcp        (1) <TCP server (emits fifo filenames)>
     --tee        (1) <shell command; duplicates data to stdin of command>
  -C|--uncount    (0) -- the opposite of --count; repeats each row N times
  -V|--variance   (0) -- running variance
  -w|--with       (1) <pseudofile to join column-wise onto input>

and prefix commands are:

  documentation (not used with normal commands):
    --explain           <other-options>
    --expand-pseudofile <filename>
    --expand-code       <code>
    --expand-gnuplot    <gnuplot options>
    --expand-sql        <sql>

  pipeline modifiers:
    --quote     -- quotes args: eval $(nfu --quote ...)
    --use       <>
    --run       <perl code>

argument bracket preprocessing:

  ^stuff -> [ -stuff ]

   [ ]    nfu as function: [ -gc ]     == "$(nfu --quote -gc)"
  @[ ]    nfu as data:    @[ -gc foo ] == sh:"$(nfu --quote -gc foo)"
  q[ ]    quote things:   q[ foo bar ] == "foo bar"

pseudofile patterns:

  file.bz2       decompress file with bzip2 -dc
  file.gz        decompress file with gzip -dc
  file.lzo       decompress file with lzop -dc
  file.xz        decompress file with xz -dc
  hdfs:path      read HDFS file(s) with hadoop fs -text
  hdfsjoin:path  mapside join pseudofile (a subset of hdfs:path)
  http[s]://url  retrieve url with curl
  id:X           verbatim text X
  n:number       numbers from 1 to n, inclusive
  perl:expr      perl -e 'print "$_\n" for (expr)'
  s3://url       access S3 using s3cmd
  sh:stuff       run sh -c "stuff", take stdout
  sql:db:query   results of query as TSV
  user@host:x    remote data access (x can be a pseudofile)

gnuplot expansions:

  %d -> ' with dots'
  %i -> ' with impulses'
  %l -> ' with lines'
  %p -> ' lc palette '
  %t -> ' title '
  %u -> ' using '
  %v -> ' with vectors '

SQL expansions:

  %\* -> ' select * from '
  %c -> ' select count(1) from '
  %d -> ' select distinct * from '
  %g -> ' group by '
  %j -> ' inner join '
  %l -> ' outer left join '
  %r -> ' outer right join '
  %w -> ' where '

database prefixes:

  P = PostgreSQL
  S = SQLite 3

environment variables:

  NFU_ALWAYS_VERBOSE    if set, nfu will be verbose all the time
  NFU_HADOOP_COMMAND    hadoop executable; e.g. hadoop jar, hadoop fs -ls
  NFU_HADOOP_OPTIONS    -D options for hadoop streaming jobs
  NFU_HADOOP_STREAMING  absolute location of hadoop-streaming.jar
  NFU_HADOOP_TMPDIR     default /tmp; temp dir for hadoop uploads
  NFU_MAX_FILEHANDLES   default 64; maximum #subprocesses for --partition
  NFU_NO_PAGER          if set, nfu will not use "less" to preview stdout
  NFU_PMAP_PARALLELISM  number of subprocesses for -M
  NFU_SORT_BUFFER       default 256M; size of in-memory sort for -g and -o
  NFU_SORT_COMPRESS     default none; compression program for sort tempfiles
  NFU_SORT_PARALLEL     default 4; number of concurrent sorts to run

see for documentation